{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
    "        <br>\n",
    "        <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n",
    "        |\n",
    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "        |\n",
    "        <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n",
    "    </p>\n",
    "</center>\n",
    "<h1 align=\"center\">Tracing and Evaluating a LlamaIndex Application</h1>\n",
    "\n",
    "LlamaIndex provides high-level APIs that enable users to build powerful applications in a few lines of code. However, it can be challenging to understand what is going on under the hood and to pinpoint the cause of issues. Phoenix makes your LLM applications *observable* by visualizing the underlying structure of each call to your query engine and surfacing problematic `spans`` of execution based on latency, token count, or other evaluation metrics.\n",
    "\n",
    "In this tutorial, you will:\n",
    "- Build a simple query engine using LlamaIndex that uses retrieval-augmented generation to answer questions over the Arize documentation,\n",
    "- Record trace data in [OpenInference tracing](https://github.com/Arize-ai/openinference) format using the global `arize_phoenix` handler\n",
    "- Inspect the traces and spans of your application to identify sources of latency and cost,\n",
    "- Export your trace data as a pandas dataframe and run an [LLM Evals](https://docs.arize.com/phoenix/concepts/llm-evals) to measure the precision@k of the query engine's retrieval step.\n",
    "\n",
    "ℹ️ This notebook requires an OpenAI API key."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install Dependencies and Import Libraries\n",
    "\n",
    "Install Phoenix, LlamaIndex, and OpenAI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"arize-phoenix[evals]\" \"openai>=1\" gcsfs nest-asyncio \"llama-index>=0.10.3\" \"openinference-instrumentation-llama-index>=1.0.0\" \"llama-index-callbacks-arize-phoenix>=0.1.2\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "from getpass import getpass\n",
    "from urllib.request import urlopen\n",
    "\n",
    "import nest_asyncio\n",
    "import openai\n",
    "import pandas as pd\n",
    "import phoenix as px\n",
    "from gcsfs import GCSFileSystem\n",
    "from llama_index.core import (\n",
    "    Settings,\n",
    "    StorageContext,\n",
    "    load_index_from_storage,\n",
    "    set_global_handler,\n",
    ")\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from phoenix.evals import (\n",
    "    HallucinationEvaluator,\n",
    "    OpenAIModel,\n",
    "    QAEvaluator,\n",
    "    RelevanceEvaluator,\n",
    "    run_evals,\n",
    ")\n",
    "from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents\n",
    "from phoenix.trace import DocumentEvaluations, SpanEvaluations\n",
    "from tqdm import tqdm\n",
    "\n",
    "nest_asyncio.apply()  # needed for concurrent evals in notebook environments\n",
    "pd.set_option(\"display.max_colwidth\", 1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Launch Phoenix\n",
    "\n",
    "You can run Phoenix in the background to collect trace data emitted by any LlamaIndex application that has been instrumented with the `OpenInferenceTraceCallbackHandler`. Phoenix supports LlamaIndex's [one-click observability](https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/one_click_observability.html) which will automatically instrument your LlamaIndex application! You can consult our [integration guide](https://docs.arize.com/phoenix/integrations/llamaindex) for a more detailed explanation of how to instrument your LlamaIndex application.\n",
    "\n",
    "Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI (the UI should be empty because we have yet to run the LlamaIndex application)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "session = px.launch_app()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Configure Your OpenAI API Key\n",
    "\n",
    "Set your OpenAI API key if it is not already set as an environment variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
    "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
    "openai.api_key = openai_api_key\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Build Your LlamaIndex Application"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example uses a `RetrieverQueryEngine` over a pre-built index of the Arize documentation, but you can use whatever LlamaIndex application you like.\n",
    "\n",
    "Download our pre-built index of the Arize docs from cloud storage and instantiate your storage context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_system = GCSFileSystem(project=\"public-assets-275721\")\n",
    "index_path = \"arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/\"\n",
    "storage_context = StorageContext.from_defaults(\n",
    "    fs=file_system,\n",
    "    persist_dir=index_path,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Enable Phoenix tracing within LlamaIndex by setting `arize_phoenix` as the global handler. This will mount Phoenix's [OpenInferenceTraceCallback](https://docs.arize.com/phoenix/integrations/llamaindex) as the global handler. Phoenix uses OpenInference traces - an open-source standard for capturing and storing LLM application traces that enables LLM applications to seamlessly integrate with LLM observability solutions such as Phoenix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set_global_handler(\"arize_phoenix\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now ready to instantiate our query engine that will perform retrieval-augmented generation (RAG). Query engine is a generic interface in LlamaIndex that allows you to ask question over your data. A query engine takes in a natural language query, and returns a rich response. It is built on top of Retrievers. You can compose multiple query engines to achieve more advanced capability  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Settings.llm = OpenAI(model=\"gpt-4-turbo-preview\")\n",
    "Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-ada-002\")\n",
    "index = load_index_from_storage(\n",
    "    storage_context,\n",
    ")\n",
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Run Your Query Engine and View Your Traces in Phoenix\n",
    "\n",
    "We've compiled a list of commonly asked questions about Arize. Let's download the sample queries and take a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "queries_url = \"http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/arize_docs_queries.jsonl\"\n",
    "queries = []\n",
    "with urlopen(queries_url) as response:\n",
    "    for line in response:\n",
    "        line = line.decode(\"utf-8\").strip()\n",
    "        data = json.loads(line)\n",
    "        queries.append(data[\"query\"])\n",
    "queries[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run the first 10 queries and view the traces in Phoenix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for query in tqdm(queries[:10]):\n",
    "    query_engine.query(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And just for fun, ask your own question!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"What is Arize and how can it help me as an AI Engineer?\")\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the Phoenix UI as your queries run. Your traces should appear in real time.\n",
    "\n",
    "Open the Phoenix UI with the link below if you haven't already and click through the queries to better understand how the query engine is performing. For each trace you will see a break\n",
    "\n",
    "Phoenix can be used to understand and troubleshoot your by surfacing:\n",
    " - **Application latency** - highlighting slow invocations of LLMs, Retrievers, etc.\n",
    " - **Token Usage** - Displays the breakdown of token usage with LLMs to surface up your most expensive LLM calls\n",
    " - **Runtime Exceptions** - Critical runtime exceptions such as rate-limiting are captured as exception events.\n",
    " - **Retrieved Documents** - view all the documents retrieved during a retriever call and the score and order in which they were returned\n",
    " - **Embeddings** - view the embedding text used for retrieval and the underlying embedding model\n",
    "LLM Parameters - view the parameters used when calling out to an LLM to debug things like temperature and the system prompts\n",
    " - **Prompt Templates** - Figure out what prompt template is used during the prompting step and what variables were used.\n",
    " - **Tool Descriptions** - view the description and function signature of the tools your LLM has been given access to\n",
    " - **LLM Function Calls** - if using OpenAI or other a model with function calls, you can view the function selection and function messages in the input messages to the LLM.\n",
    "\n",
    "<img src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/RAG_trace_details.png\" alt=\"Trace Details View on Phoenix\" style=\"width:100%; height:auto;\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"🚀 Open the Phoenix UI if you haven't already: {session.url}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Export and Evaluate Your Trace Data\n",
    "\n",
    "You can export your trace data as a pandas dataframe for further analysis and evaluation.\n",
    "\n",
    "In this case, we will export our `retriever` spans into two separate dataframes:\n",
    "- `queries_df`, in which the retrieved documents for each query are concatenated into a single column,\n",
    "- `retrieved_documents_df`, in which each retrieved document is \"exploded\" into its own row to enable the evaluation of each query-document pair in isolation.\n",
    "\n",
    "This will enable us to compute multiple kinds of evaluations, including:\n",
    "- relevance: Are the retrieved documents grounded in the response?\n",
    "- Q&A correctness: Are your application's responses grounded in the retrieved context?\n",
    "- hallucinations: Is your application making up false information?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "queries_df = get_qa_with_reference(px.Client())\n",
    "retrieved_documents_df = get_retrieved_documents(px.Client())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, define your evaluation model and your evaluators.\n",
    "\n",
    "Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_model = OpenAIModel(\n",
    "    model=\"gpt-4-turbo-preview\",\n",
    ")\n",
    "hallucination_evaluator = HallucinationEvaluator(eval_model)\n",
    "qa_correctness_evaluator = QAEvaluator(eval_model)\n",
    "relevance_evaluator = RelevanceEvaluator(eval_model)\n",
    "\n",
    "hallucination_eval_df, qa_correctness_eval_df = run_evals(\n",
    "    dataframe=queries_df,\n",
    "    evaluators=[hallucination_evaluator, qa_correctness_evaluator],\n",
    "    provide_explanation=True,\n",
    ")\n",
    "relevance_eval_df = run_evals(\n",
    "    dataframe=retrieved_documents_df,\n",
    "    evaluators=[relevance_evaluator],\n",
    "    provide_explanation=True,\n",
    ")[0]\n",
    "\n",
    "px.Client().log_evaluations(\n",
    "    SpanEvaluations(eval_name=\"Hallucination\", dataframe=hallucination_eval_df),\n",
    "    SpanEvaluations(eval_name=\"QA Correctness\", dataframe=qa_correctness_eval_df),\n",
    "    DocumentEvaluations(eval_name=\"Relevance\", dataframe=relevance_eval_df),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your evaluations should now appear as annotations on the appropriate spans in Phoenix.\n",
    "\n",
    "![A view of the Phoenix UI with evaluation annotations](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"🚀 Open the Phoenix UI if you haven't already: {session.url}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Final Thoughts\n",
    "\n",
    "LLM Traces and the accompanying OpenInference Tracing specification is designed to be a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context such as retrieval from vector stores and the usage of external tools such as search engines or APIs. It lets you understand the inner workings of the individual steps your application takes wile also giving you visibility into how your system is running and performing as a whole. \n",
    "\n",
    "LLM Evals are designed for simple, fast, and accurate LLM-based evaluations. They let you quickly benchmark the performance of your LLM application and help you identify the problematic spans of execution.\n",
    "\n",
    "For more details on Phoenix, LLM Tracing, and LLM Evals, checkout our [documentation](https://docs.arize.com/phoenix/)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
